The present document is divided in 3 parts:

Download and Import the Dataset

Download

The dataset used for this assignment contains data of several hundreds wine reviews, with information about the country of production and the unit cost. The dataset contains both quantitative and qualitative variables.

The dataset was downloaded from the website kaggle.

Import

In order to import the dataset in RStudio, I used the read.csv function. Then, I renamed it “DATA”.

#load the dataset
read.csv("winemag-data_first150k.csv", header = T, sep = ",")
#Name the dataset
DATA <- read.csv("winemag-data_first150k.csv", header = T, sep = ",")

Descriptive Statistics

The first step to analyze this dataset is to show its structure and give some high-level information regarding it. Using the dim function we know that the dataset contains 150,930 observations and 11 variables. The ls function shows the column’s headers. The head function allows to explore the first 6 rows of the dataset. Finally the str function gives more info for each variable.

#more info about the dataset
dim(DATA)
## [1] 150930     11
ls(DATA)
##  [1] "country"     "description" "designation" "points"      "price"      
##  [6] "province"    "region_1"    "region_2"    "variety"     "winery"     
## [11] "X"
head(DATA)
str(DATA)
## 'data.frame':    150930 obs. of  11 variables:
##  $ X          : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ country    : chr  "US" "Spain" "US" "US" ...
##  $ description: chr  "This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak. Juicy red-cherry f"| __truncated__ "Ripe aromas of fig, blackberry and cassis are softened and sweetened by a slathering of oaky chocolate and vani"| __truncated__ "Mac Watson honors the memory of a wine once made by his mother in this tremendously delicious, balanced and com"| __truncated__ "This spent 20 months in 30% new French oak, and incorporates fruit from Ponzi's Aurora, Abetina and Madrona vin"| __truncated__ ...
##  $ designation: chr  "Martha's Vineyard" "Carodorum Selección Especial Reserva" "Special Selected Late Harvest" "Reserve" ...
##  $ points     : int  96 96 96 96 95 95 95 95 95 95 ...
##  $ price      : num  235 110 90 65 66 73 65 110 65 60 ...
##  $ province   : chr  "California" "Northern Spain" "California" "Oregon" ...
##  $ region_1   : chr  "Napa Valley" "Toro" "Knights Valley" "Willamette Valley" ...
##  $ region_2   : chr  "Napa" "" "Sonoma" "Willamette Valley" ...
##  $ variety    : chr  "Cabernet Sauvignon" "Tinta de Toro" "Sauvignon Blanc" "Pinot Noir" ...
##  $ winery     : chr  "Heitz" "Bodega Carmen Rodríguez" "Macauley" "Ponzi" ...

Because I run some statistics, I wanted to verify whether and how many null values the dataset contains.

#control of NA values
any(is.na(DATA))
## [1] TRUE
sum(is.na(DATA))
## [1] 13695
colSums(is.na(DATA))
##           X     country description designation      points       price 
##           0           0           0           0           0       13695 
##    province    region_1    region_2     variety      winery 
##           0           0           0           0           0

From this control, I realized that the dataset contains 13,695 null values and that all of them are in the Price column. Knowing this, I decided to work with a version of the dataset that does not contain rows with null values. I called this smaller version of it “DATA.clean”. This modified dataset contains 137,235 rows, 13,695 rows less than the original dataset. I could have used a function that allowed me to exclude only the rows that contain null values for the Price variable, but in this particular case all the null values are on the Price column, so it makes no difference.

DATA.clean <- na.omit(DATA)
any(is.na(DATA.clean))
## [1] FALSE

The next step is to run some summary statistics of the main variables. For example I found that the average price for a wine bottle is 33.13 USD, the cheapest wine is priced at 4 USD, the most expensive bottle costs 2,300 USD, and the quantiles are respectively 4, 16, 24, 40, and 2,300 USD.

#summary of price
summary(DATA.clean$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   16.00   24.00   33.13   40.00 2300.00
median(DATA.clean$price)
## [1] 24
quantile(DATA.clean$price)
##   0%  25%  50%  75% 100% 
##    4   16   24   40 2300
min(DATA.clean$price)
## [1] 4
max(DATA.clean$price)
## [1] 2300
#summary of location
summary(DATA.clean$country)
##    Length     Class      Mode 
##    137235 character character
#summary of review points
summary(DATA.clean$points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   80.00   86.00   88.00   87.79   90.00  100.00
#summary of variety
summary(DATA.clean$variety)
##    Length     Class      Mode 
##    137235 character character

Then, I focused on the Country variable and I created a table that shows the distribution of the reviewed wines between the countries.

#focus on country variable
table(DATA.clean$country)
## 
##                                       Albania              Argentina 
##                      5                      2                   5587 
##              Australia                Austria Bosnia and Herzegovina 
##                   4894                   2483                      4 
##                 Brazil               Bulgaria                 Canada 
##                     25                     77                    194 
##                  Chile                  China                Croatia 
##                   5766                      3                     83 
##                 Cyprus         Czech Republic                England 
##                     31                      6                      8 
##                 France                Georgia                Germany 
##                  14785                     43                   2347 
##                 Greece                Hungary                  India 
##                    872                    230                      8 
##                 Israel                  Italy                  Japan 
##                    610                  18784                      2 
##                Lebanon              Lithuania             Luxembourg 
##                     37                      8                      9 
##              Macedonia                 Mexico                Moldova 
##                     16                     63                     71 
##             Montenegro                Morocco            New Zealand 
##                      2                     12                   3070 
##               Portugal                Romania                 Serbia 
##                   4176                    139                     14 
##               Slovakia               Slovenia           South Africa 
##                      3                     81                   2237 
##            South Korea                  Spain            Switzerland 
##                      4                   8160                      4 
##                 Turkey                Ukraine                Uruguay 
##                     50                      5                     85 
##                     US              US-France 
##                  62139                      1
producers <- table(DATA.clean$country)

table(DATA.clean$country)/nrow(DATA.clean)
## 
##                                       Albania              Argentina 
##           3.643385e-05           1.457354e-05           4.071119e-02 
##              Australia                Austria Bosnia and Herzegovina 
##           3.566146e-02           1.809305e-02           2.914708e-05 
##                 Brazil               Bulgaria                 Canada 
##           1.821693e-04           5.610814e-04           1.413634e-03 
##                  Chile                  China                Croatia 
##           4.201552e-02           2.186031e-05           6.048020e-04 
##                 Cyprus         Czech Republic                England 
##           2.258899e-04           4.372063e-05           5.829417e-05 
##                 France                Georgia                Germany 
##           1.077349e-01           3.133311e-04           1.710205e-02 
##                 Greece                Hungary                  India 
##           6.354064e-03           1.675957e-03           5.829417e-05 
##                 Israel                  Italy                  Japan 
##           4.444930e-03           1.368747e-01           1.457354e-05 
##                Lebanon              Lithuania             Luxembourg 
##           2.696105e-04           5.829417e-05           6.558094e-05 
##              Macedonia                 Mexico                Moldova 
##           1.165883e-04           4.590666e-04           5.173607e-04 
##             Montenegro                Morocco            New Zealand 
##           1.457354e-05           8.744125e-05           2.237039e-02 
##               Portugal                Romania                 Serbia 
##           3.042956e-02           1.012861e-03           1.020148e-04 
##               Slovakia               Slovenia           South Africa 
##           2.186031e-05           5.902284e-04           1.630051e-02 
##            South Korea                  Spain            Switzerland 
##           2.914708e-05           5.946005e-02           2.914708e-05 
##                 Turkey                Ukraine                Uruguay 
##           3.643385e-04           3.643385e-05           6.193755e-04 
##                     US              US-France 
##           4.527927e-01           7.286771e-06
perc.producers <- table(DATA.clean$country)/nrow(DATA.clean)

Transformation and Plot

Transformation

I decided to transform the variable Points using log10 function.

summary(DATA.clean$points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   80.00   86.00   88.00   87.79   90.00  100.00
summary(log10(DATA.clean$points))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.903   1.934   1.944   1.943   1.954   2.000

Plot

For the final step, I plotted a bar chart of the countries distribution to see which countries are the most important producers of wine among the sample contained in the dataset.